Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use less CPU and RAM #549

Merged
merged 4 commits into from
Mar 11, 2023
Merged

Use less CPU and RAM #549

merged 4 commits into from
Mar 11, 2023

Conversation

daladim
Copy link
Contributor

@daladim daladim commented Feb 7, 2023

Refs #537 .

This PR:

  • uses img2pdf instead of imagemagick to create PDFs. This saves much CPU and RAM usage, because this merely "staples together" various JPG files, whereas imagemagick would load every JPG into memory, do some kind of transcoding then writes
    them to a PDF file.
  • when image transcoding/compression is needed, it does this file by file, rather than "all files at once". This avoids imagemagick trying to load all of them to memory before it converts them.

This makes some pipelines slightly faster to process, but this improves RAM usage a lot for most of them when multiple pages are scanned. On older Raspberry Pis, this fixes the "pipeline has stalled because RAM is full" issue.

Disclaimers:

  • I'm not a JS developer, maybe there's a more idiomatic way to add a final newline in the first commit
  • I have tested with my scanner and my Raspberry Pi. In case you have more scanners and more devices to test, feel free to do so :)

This only refs #537 and does not fix it, because there's still the "transcode every page when the next page is being scanned, instead of transcoding all of them at the end" improvement that could be done one day.

Thanks

img2pdf only "staples together" various JPG files, whereas imagemagick
loads every JPG into memory, does some kind of transcoding _then_ writes
them to a PDF file.
This change improves CPU and RAM usage when scanning to PDF files.

Refs sbs20#537
@sbs20
Copy link
Owner

sbs20 commented Feb 15, 2023

To save me digging through and testing, why does there have to be an additional new line added at the end? Is it the while read or xargs?

@daladim
Copy link
Contributor Author

daladim commented Feb 15, 2023

I'm not sure about xargs. But read only splits strings at newlines, so it ignores the final line that does not finish with a LF.

There are tricks to also handle the final line in a loop, such as while read line || [ -n "$line" ]; do ...; done, but well, it feels that's better just to add a final LF :)

@sbs20
Copy link
Owner

sbs20 commented Feb 16, 2023

That link was fascinating, thank you. Let me ponder it further.

@sbs20
Copy link
Owner

sbs20 commented Feb 28, 2023

First - thank you so much for taking the time to do this and raise such a well documented PR.

Second, sorry it's taken until now to reply.

I have given this a LOT of thought. And I have decided not to merge the PR. The main problem for me is the additional dependencies of img2pdf. There are users who really don't want additional dependencies and even go so far as to edit the Dockerfile to reduce the image size. It should not stop you from running your code - although you will need to adapt your config.local.js to set your pipelines.

What I have done is raise #569 which will make it so your pipelines will work easily.

What I am also going to do is see about merging in the while read filename... code - so I am not yet closing this PR as I want to cherrypick.

@daladim
Copy link
Contributor Author

daladim commented Feb 28, 2023

Thanks for your message. img2pdf indeed is quite a large dependency, especially as it pulls the whole Python runtime. I understand that you are wary including it.

  • As you said, it does not prevent me from using it, I'll just have to include my own pipelines in config.local.js.
    I am using Docker though. I had tried to make an extended image using a Dockerfile that reads FROM sbs20/scanservjs:... ; RUN apt-get update && apt-get install -y img2pdf. However, when running docker build, apt-get fails for some reason (I cannot reproduce as I cannot access my raspberry pi right now). Something about missing permissions to perform whatever late installation step. Does this ring a bell to you? Do you know how I could still install img2pdf in an extended Docker image, without building the whole image from scratch?
  • Not upstreaming these changes is too bad after all, because it makes scanservjs not work out-of-the-box for raspberry pi users that have a scanner with ADF. That's too bad because scanservjs is, by far, the best (and only?) solution I found to scan on a headless raspberry pi.
    Img2pdf documentation, in its last paragraph, suggests tesseract is able to create PDFs without re-encoding. That would be great because it already is in the Docker image, but I could not make it work without having it insist on doing any OCR. Do you think it is possible though? You're probably more familiar with tesseract as I am.
    Otherwise, do you think you would be OK to either
    • publish two images (say, the "regular" scanservjs:$version image, plus an image tagged scanservjs:$version-with-extras)?
    • Even without img2pdf, there still is the "file-by-file conversion" from this PR that does not require any extra dependency, that seems useful for lower-end machines (and that, I think, does not cause any harm on more up-to-date machines).
    • At least write instructions in documentation files for raspberry pi users?

Out of curiosity, what machine are you using scanservjs on?
Thanks for your feedback!

@sbs20
Copy link
Owner

sbs20 commented Feb 28, 2023

Something about missing permissions to perform whatever late installation step. Does this ring a bell to you?

It all depends on what the ... is in FROM sbs20/scanservjs:.... If you're using this then it switches user and you won't have sufficient privs. You should be using this instead. If it's not that, then I don't know I'm afraid.

But I could not make it work without having it insist on doing any OCR. Do you think it is possible though?

No idea I'm afraid - I am actually fairly unfamiliar with Tesseract, sorry!

publish two images (say, the "regular" scanservjs:$version image, plus an image tagged scanservjs:$version-with-extras)

I would gladly link to someone else's image. I am less keen to add and maintain yet another image. I had never intended to support Docker - it adds a lot of support overhead, but somehow did anyway; this project is already placing heavy demands on my time.

there still is the "file-by-file conversion" from this PR that does not require any extra dependency

Yep - and I may yet include that, I just couldn't find the time to get it done and fully tested today. That's what I meant above with What I am also going to do is see about merging in the while read filename... code. I wanted to make the change in #569 which would at least mean you could use standard configuration changes.

At least write instructions in documentation files for raspberry pi users

Indeed. I do always try to do this - but I would welcome contributions.

@sbs20 sbs20 changed the base branch from master to iss537 March 11, 2023 17:14
@sbs20 sbs20 merged commit 26dce7b into sbs20:iss537 Mar 11, 2023
sbs20 added a commit that referenced this pull request Mar 11, 2023
Related to #537

This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.
sbs20 added a commit that referenced this pull request Mar 12, 2023
Related to #537

This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.
sbs20 added a commit that referenced this pull request May 7, 2023
Related to #537

This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.
sbs20 added a commit that referenced this pull request May 7, 2023
Related to #537

This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.
sbs20 added a commit that referenced this pull request Sep 27, 2023
Related to #537

This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.
@sbs20 sbs20 mentioned this pull request Sep 27, 2023
sbs20 added a commit that referenced this pull request Sep 27, 2023
Related to #537

This has cherry picked changes from #549 - essentially to change from using `convert @-` to a bash while loop.
sbs20 added a commit that referenced this pull request Oct 13, 2023
Follow on to correct #549
sbs20 added a commit that referenced this pull request Oct 16, 2023
Follow on to correct #549
sbs20 added a commit that referenced this pull request Oct 16, 2023
Follow on to correct #549
@sbs20 sbs20 mentioned this pull request Oct 16, 2023
sbs20 added a commit that referenced this pull request Oct 16, 2023
Follow on to correct #549
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants